flowchart LR
A[Create DGM] --> B[Simulate Trials]
B --> C[Run Analyses]
C --> D[Summarize Results]
Simulation Studies for Evaluating ForestSearch Performance
Operating Characteristics and Power Analysis
1 Introduction
This vignette demonstrates how to conduct simulation studies to evaluate the performance of ForestSearch for identifying subgroups with differential treatment effects. The simulation framework allows you to:
- Generate synthetic clinical trial data with known treatment effect heterogeneity
- Evaluate subgroup identification rates (power)
- Assess classification accuracy (sens, spec, PPV, NPV)
- Compare different analysis methods (ForestSearch, GRF)
- Estimate Type I error under null hypothesis
1.1 Simulation Framework Overview
The simulation workflow consists of four main steps:
- Create DGM: Define a data generating mechanism with specified treatment effects
- Simulate Trials: Generate multiple simulated datasets
- Run Analyses: Apply ForestSearch (and optionally GRF) to each dataset
- Summarize Results: Aggregate operating characteristics across simulations
1.2 Key Updates in This Version
The simulation framework has been aligned with generate_aft_dgm_flex() methodology:
| Feature | Description |
|---|---|
| Individual Potential Outcomes | theta_0, theta_1, loghr_po columns |
| Average Hazard Ratios (AHR) | Alternative to Cox-based HR estimation |
| Stacked PO for Cox HR | Same epsilon for causal inference |
use_twostage Parameter |
Faster exploratory analysis option |
| Backward Compatible | Works with old and new DGM formats |
2 Setup
# Core packages
library(forestsearch)
library(weightedsurv)
library(data.table)
library(survival)
library(ggplot2)
library(gt)
# Parallel processing
library(foreach)
library(doFuture)
library(future)
# Source simulation functions (if not yet in package)
# source("sim_aft_gbsg_refactored.R")
# source("oc_analyses_gbsg.R")3 Creating a Data Generating Mechanism
The simulation framework uses the German Breast Cancer Study Group (GBSG) dataset as a template for realistic covariate distributions and censoring patterns.
3.1 Understanding the DGM
The create_gbsg_dgm() function creates a data generating mechanism (DGM) based on an Accelerated Failure Time (AFT) model with Weibull distribution. Key features:
- Covariates: Age, estrogen receptor, menopausal status, progesterone receptor, nodes
- Treatment effect heterogeneity: Specified via interaction terms
- Subgroup definition: H = {low estrogen receptor AND premenopausal}
- Censoring: Weibull or uniform censoring model
3.1.1 New Output Structure (Aligned with generate_aft_dgm_flex)
The DGM now includes:
dgm$hazard_ratios <- list(
overall = hr_causal, # Cox-based overall HR
AHR = AHR, # Average HR from loghr_po
AHR_harm = AHR_H_true, # AHR in harm subgroup
AHR_no_harm = AHR_Hc_true, # AHR in complement
harm_subgroup = hr_H_true, # Cox-based HR in H
no_harm_subgroup = hr_Hc_true # Cox-based HR in Hc
)
The super-population data (dgm$df_super_rand) now contains:
| Column | Description |
|---|---|
theta_0 |
Log-hazard contribution under control |
theta_1 |
Log-hazard contribution under treatment |
loghr_po |
Individual causal log hazard ratio (theta_1 - theta_0) |
3.2 Alternative Hypothesis (Heterogeneous Treatment Effect)
Under the alternative hypothesis, we create a DGM where the treatment effect varies across patient subgroups:
# Create DGM with heterogeneous treatment effect
# HR in harm subgroup (H) will be > 1 (treatment harmful)
# HR in complement (H^c) will be < 1 (treatment beneficial)
dgm_alt <- create_gbsg_dgm(
model = "alt",
k_treat = 1.0,
k_inter = 2.0, # Interaction effect multiplier
k_z3 = 1.0,
z1_quantile = 0.25, # ER threshold at 25th percentile
n_super = 5000,
cens_type = "weibull",
seed = 8316951,
verbose = TRUE
)
# Examine the DGM (print method now shows both HR and AHR metrics)
print(dgm_alt)GBSG-Based AFT Data Generating Mechanism (Aligned)
===================================================
Model type: alt
Super-population size: 5000
Effect Modifiers:
k_treat: 1
k_inter: 2
k_z3: 1
Hazard Ratios (Cox-based, stacked PO):
Overall (causal): 0.7331
Harm subgroup (H): 2.9651
Complement (Hc): 0.6612
Ratio HR(H)/HR(Hc): 4.4846
Average Hazard Ratios (from loghr_po):
AHR (overall): 0.7431
AHR_harm (H): 3.8687
AHR_no_harm (Hc): 0.5848
Ratio AHR(H)/AHR(Hc): 6.6157
Subgroup definition: z1 == 1 & z3 == 1 (low ER & premenopausal)
ER threshold: 8 (quantile = 0.25)
Subgroup size: 634 (12.7%)
Analysis variables: v1, v2, v3, v4, v5, v6, v7
3.2.1 Accessing Hazard Ratios (New Aligned Format)
# Traditional access (backward compatible)
cat("Cox-based HRs:\n")Cox-based HRs:
cat(" HR(H):", round(dgm_alt$hr_H_true, 4), "\n") HR(H): 2.9651
cat(" HR(Hc):", round(dgm_alt$hr_Hc_true, 4), "\n") HR(Hc): 0.6612
cat(" HR(overall):", round(dgm_alt$hr_causal, 4), "\n") HR(overall): 0.7331
# New AHR metrics (aligned with generate_aft_dgm_flex)
cat("\nAverage Hazard Ratios (from loghr_po):\n")
Average Hazard Ratios (from loghr_po):
cat(" AHR(H):", round(dgm_alt$AHR_H_true, 4), "\n") AHR(H): 3.8687
cat(" AHR(Hc):", round(dgm_alt$AHR_Hc_true, 4), "\n") AHR(Hc): 0.5848
cat(" AHR(overall):", round(dgm_alt$AHR, 4), "\n") AHR(overall): 0.7431
# Using hazard_ratios list (unified access)
cat("\nVia hazard_ratios list:\n")
Via hazard_ratios list:
cat(" harm_subgroup:", round(dgm_alt$hazard_ratios$harm_subgroup, 4), "\n") harm_subgroup: 2.9651
cat(" AHR_harm:", round(dgm_alt$hazard_ratios$AHR_harm, 4), "\n") AHR_harm: 3.8687
3.2.2 Examining Individual-Level Treatment Effects
# The super-population now includes individual log hazard ratios
df_super <- dgm_alt$df_super_rand
cat("Individual-level potential outcomes:\n")Individual-level potential outcomes:
cat(" theta_0 (control log-hazard) range:",
round(range(df_super$theta_0), 3), "\n") theta_0 (control log-hazard) range: -0.891 1.76
cat(" theta_1 (treatment log-hazard) range:",
round(range(df_super$theta_1), 3), "\n") theta_1 (treatment log-hazard) range: -1.427 2.909
cat(" loghr_po (individual log-HR) range:",
round(range(df_super$loghr_po), 3), "\n") loghr_po (individual log-HR) range: -0.537 1.353
# Verify AHR calculation
ahr_manual <- exp(mean(df_super$loghr_po))
cat("\nAHR verification:\n")
AHR verification:
cat(" exp(mean(loghr_po)) =", round(ahr_manual, 4), "\n") exp(mean(loghr_po)) = 0.7431
cat(" dgm$AHR =", round(dgm_alt$AHR, 4), "\n") dgm$AHR = 0.7431
# Distribution of individual treatment effects
cat("\nIndividual HR distribution:\n")
Individual HR distribution:
individual_hr <- exp(df_super$loghr_po)
cat(" Mean:", round(mean(individual_hr), 4), "\n") Mean: 1.0012
cat(" Median:", round(median(individual_hr), 4), "\n") Median: 0.5848
cat(" SD:", round(sd(individual_hr), 4), "\n") SD: 1.0928
3.3 Calibrating for a Target Hazard Ratio
Often, you want to specify the exact hazard ratio in the harm subgroup. Use calibrate_k_inter() to find the interaction parameter that achieves this.
3.3.1 Calibrate to Cox-based HR (Default)
# Find k_inter for Cox-based HR = 2.0 in harm subgroup
k_inter_cox <- calibrate_k_inter(
target_hr_harm = 2.0,
model = "alt",
k_treat = 1.0,
cens_type = "weibull",
use_ahr = FALSE, # Default: calibrate to Cox-based HR
verbose = TRUE
)
# Create DGM with calibrated k_inter
dgm_calibrated_cox <- create_gbsg_dgm(
model = "alt",
k_treat = 1.0,
k_inter = k_inter_cox,
verbose = TRUE
)
cat("\nVerification (Cox-based):\n")
Verification (Cox-based):
cat("Achieved HR(H):", round(dgm_calibrated_cox$hr_H_true, 3), "\n")Achieved HR(H): 2
cat("HR(H^c):", round(dgm_calibrated_cox$hr_Hc_true, 3), "\n")HR(H^c): 0.661
cat("Overall HR:", round(dgm_calibrated_cox$hr_causal, 3), "\n")Overall HR: 0.722
3.3.2 Calibrate to AHR (New Option)
# Alternatively, calibrate to Average Hazard Ratio
k_inter_ahr <- calibrate_k_inter(
target_hr_harm = 2.0,
model = "alt",
k_treat = 1.0,
cens_type = "weibull",
use_ahr = TRUE, # NEW: calibrate to AHR instead
verbose = TRUE
)
# Create DGM with AHR-calibrated k_inter
dgm_calibrated_ahr <- create_gbsg_dgm(
model = "alt",
k_treat = 1.0,
k_inter = k_inter_ahr,
verbose = TRUE
)
cat("\nVerification (AHR-based):\n")
Verification (AHR-based):
cat("Achieved AHR(H):", round(dgm_calibrated_ahr$AHR_H_true, 3), "\n")Achieved AHR(H): 2
cat("AHR(H^c):", round(dgm_calibrated_ahr$AHR_Hc_true, 3), "\n")AHR(H^c): 0.585
cat("Overall AHR:", round(dgm_calibrated_ahr$AHR, 3), "\n")Overall AHR: 0.683
3.3.3 Compare Cox HR vs AHR Calibration
# Compare the two calibration approaches
cat("Comparison of calibration methods:\n")Comparison of calibration methods:
cat(sprintf("%-20s %-12s %-12s\n", "Metric", "Cox-calib", "AHR-calib"))Metric Cox-calib AHR-calib
cat(sprintf("%-20s %-12.4f %-12.4f\n", "k_inter", k_inter_cox, k_inter_ahr))k_inter 1.4947 1.3016
cat(sprintf("%-20s %-12.4f %-12.4f\n", "HR(H)",
dgm_calibrated_cox$hr_H_true, dgm_calibrated_ahr$hr_H_true))HR(H) 2.0000 1.7233
cat(sprintf("%-20s %-12.4f %-12.4f\n", "AHR(H)",
dgm_calibrated_cox$AHR_H_true, dgm_calibrated_ahr$AHR_H_true))AHR(H) 2.4001 2.0000
3.4 Validating k_inter Effect on Heterogeneity
Use validate_k_inter_effect() to verify the interaction parameter properly modulates treatment effect heterogeneity:
# Test k_inter effect on HR heterogeneity
# k_inter = 0 should give ratio ≈ 1 (no heterogeneity)
validation_results <- validate_k_inter_effect(
k_inter_values = c(-2, -1, 0, 1, 2, 3),
verbose = TRUE
)Testing k_inter effect on HR heterogeneity...
k_inter HR(H) HR(Hc) AHR(H) AHR(Hc) Ratio(Cox) Ratio(AHR)
----------------------------------------------------------------------
-2.0 0.1336 0.6612 0.0884 0.5848 0.2021 0.1512
-1.0 0.3033 0.6612 0.2274 0.5848 0.4587 0.3888
0.0 0.6552 0.6612 0.5848 0.5848 0.9909 1.0000
1.0 1.3873 0.6612 1.5041 0.5848 2.0982 2.5721
2.0 2.9651 0.6612 3.8687 0.5848 4.4846 6.6157
3.0 6.6375 0.6612 9.9507 0.5848 10.0387 17.0162
PASS: k_inter = 0 gives Cox ratio ~= 1 (no heterogeneity)
PASS: k_inter = 0 gives AHR ratio ~= 1 (no heterogeneity)
AHR vs Cox HR alignment:
k_inter = -2.0: HR(H) vs AHR(H) diff = 0.0452
k_inter = -1.0: HR(H) vs AHR(H) diff = 0.0759
k_inter = 0.0: HR(H) vs AHR(H) diff = 0.0704
k_inter = 1.0: HR(H) vs AHR(H) diff = 0.1168
k_inter = 2.0: HR(H) vs AHR(H) diff = 0.9036
k_inter = 3.0: HR(H) vs AHR(H) diff = 3.3132
3.5 Null Hypothesis (Uniform Treatment Effect)
For Type I error evaluation, create a DGM with uniform treatment effect:
# Create null DGM (no treatment effect heterogeneity)
dgm_null <- create_gbsg_dgm(
model = "null",
k_treat = 1.0,
verbose = TRUE
)
cat("\nNull hypothesis HRs:\n")
Null hypothesis HRs:
cat("Overall HR:", round(dgm_null$hr_causal, 3), "\n")Overall HR: 0.722
cat("HR(H^c):", round(dgm_null$hr_Hc_true, 3), "\n")HR(H^c): 0.722
cat("AHR(H^c):", round(dgm_null$AHR_Hc_true, 3), "\n")AHR(H^c): 0.654
cat("AHR:", round(dgm_null$AHR, 3), "\n")AHR: 0.654
4 Simulating Trial Data
4.1 Single Trial Simulation
Use simulate_from_gbsg_dgm() to generate a single simulated trial:
# Use the Cox-calibrated DGM for simulations
dgm_calibrated <- dgm_calibrated_cox
# Simulate a single trial
sim_data <- simulate_from_gbsg_dgm(
dgm = dgm_calibrated,
n = 700,
rand_ratio = 1, # 1:1 randomization
sim_id = 1,
max_follow = 84, # 84 months administrative censoring
muC_adj = log(1.5) # Censoring adjustment
)
# Examine the data
cat("Simulated trial:\n")Simulated trial:
cat(" N =", nrow(sim_data), "\n") N = 700
cat(" Events =", sum(sim_data$event.sim),
"(", round(100 * mean(sim_data$event.sim), 1), "%)\n") Events = 376 ( 53.7 %)
cat(" Harm subgroup size =", sum(sim_data$flag.harm),
"(", round(100 * mean(sim_data$flag.harm), 1), "%)\n") Harm subgroup size = 86 ( 12.3 %)
# Quick survival analysis
fit_itt <- coxph(Surv(y.sim, event.sim) ~ treat, data = sim_data)
cat(" Estimated ITT HR =", round(exp(coef(fit_itt)), 3), "\n") Estimated ITT HR = 0.741
4.1.1 Examining Individual-Level Effects in Simulated Data
# The simulated data now includes loghr_po
if ("loghr_po" %in% names(sim_data)) {
cat("\nIndividual treatment effects in simulated trial:\n")
# Compute AHR in simulated data by subgroup
ahr_H_sim <- exp(mean(sim_data$loghr_po[sim_data$flag.harm == 1]))
ahr_Hc_sim <- exp(mean(sim_data$loghr_po[sim_data$flag.harm == 0]))
ahr_overall_sim <- exp(mean(sim_data$loghr_po))
cat(" AHR(H) in sim:", round(ahr_H_sim, 3), "\n")
cat(" AHR(Hc) in sim:", round(ahr_Hc_sim, 3), "\n")
cat(" AHR(overall) in sim:", round(ahr_overall_sim, 3), "\n")
} else {
cat("\nNote: loghr_po not available in simulated data\n")
}
Individual treatment effects in simulated trial:
AHR(H) in sim: 2.4
AHR(Hc) in sim: 0.585
AHR(overall) in sim: 0.696
4.2 Examining Covariate Structure
dfcount <- df_counting(
df = sim_data,
by.risk = 6,
tte.name = "y.sim",
event.name = "event.sim",
treat.name = "treat"
)
plot_weighted_km(dfcount, conf.int = TRUE, show.logrank = TRUE,
ymax = 1.05, xmed.fraction = 0.775, ymed.offset = 0.125)create_summary_table(
data = sim_data,
treat_var = "treat",
table_title = "Characteristics by Treatment Arm",
vars_continuous = c("z1", "z2", "size", "z3", "z4", "z5"),
vars_categorical = c("flag.harm", "grade3"),
font_size = 12
)| Characteristics by Treatment Arm | |||||
| Characteristic | Control (n=350) | Treatment (n=350) | P-value1 | SMD2 | |
|---|---|---|---|---|---|
| z1 | Mean (SD) | 0.3 (0.4) | 0.2 (0.4) | 0.729 | 0.03 |
| z2 | Mean (SD) | 2.4 (1.1) | 2.5 (1.1) | 0.157 | 0.11 |
| size | Mean (SD) | 29.2 (14.3) | 30.1 (17.0) | 0.461 | 0.06 |
| z3 | Mean (SD) | 0.4 (0.5) | 0.4 (0.5) | 0.249 | 0.09 |
| z4 | Mean (SD) | 2.6 (1.1) | 2.5 (1.1) | 0.428 | 0.06 |
| z5 | Mean (SD) | 2.4 (1.1) | 2.4 (1.1) | 0.563 | 0.04 |
| flag.harm | 0.908 | 0.00 | |||
| 0 | 306 (87.4%) | 308 (88.0%) | |||
| 1 | 44 (12.6%) | 42 (12.0%) | |||
| grade3 | 0.530 | 0.02 | |||
| 0 | 273 (78.0%) | 265 (75.7%) | |||
| 1 | 77 (22.0%) | 85 (24.3%) | |||
| 1 P-values: t-test for continuous, chi-square/Fisher's exact for categorical/binary variables | |||||
| 2 SMD = Standardized mean difference (Cohen's d for continuous, Cramer's V for categorical) | |||||
5 Running Simulation Studies
5.1 Setting Up Parallel Processing
For efficient simulation studies, use parallel processing:
# Configure parallel backend
n_workers <- min(parallel::detectCores() - 1, 120)
plan(multisession, workers = n_workers)
registerDoFuture()
cat("Using", n_workers, "parallel workers\n")Using 13 parallel workers
5.2 Define Simulation Parameters
# Simulation settings
sim_config_alt <- list(
n_sims = 1000, # Number of simulations (use 500-1000 for final)
n_sample = 700, # Sample size per trial
max_follow = 84, # Maximum follow-up (months)
seed_base = 8316951,
muC_adj = log(1.5)
)
sim_config_null <- list(
n_sims = 5000, # More simulations for Type I error estimation
n_sample = 700, # Sample size per trial
max_follow = 84, # Maximum follow-up (months)
seed_base = 8316951,
muC_adj = log(1.5)
)
# ForestSearch parameters (now includes use_twostage)
fs_params <- list(
outcome.name = "y.sim",
event.name = "event.sim",
treat.name = "treat",
id.name = "id",
use_lasso = TRUE,
use_grf = TRUE,
hr.threshold = 1.25,
hr.consistency = 1.0,
pconsistency.threshold = 0.90,
fs.splits = 400,
n.min = 60,
d0.min = 12,
d1.min = 12,
maxk = 2,
by.risk = 12,
vi.grf.min = -0.2,
# NEW: Two-stage algorithm option
use_twostage = TRUE, # Set TRUE for faster exploratory analysis
twostage_args = list() # Optional tuning parameters
)
# Confounders for analysis
confounders_base <- c("z1", "z2", "z3", "z4", "z5", "size", "grade3")5.2.1 Two-Stage Algorithm Option
The use_twostage parameter enables a faster two-stage search algorithm:
# Fast configuration with two-stage algorithm
fs_params_fast <- modifyList(fs_params, list(
use_twostage = TRUE,
twostage_args = list(
n.splits.screen = 30, # Stage 1 screening splits
batch.size = 20, # Stage 2 batch size
conf.level = 0.95 # Early stopping confidence
)
))
cat("Standard search: use_twostage =", fs_params$use_twostage, "\n")Standard search: use_twostage = TRUE
cat("Fast search: use_twostage =", fs_params_fast$use_twostage, "\n")Fast search: use_twostage = TRUE
5.3 Running Alternative Hypothesis Simulations
cat("Running", sim_config_alt$n_sims, "simulations under H1...\n")Running 1000 simulations under H1...
start_time <- Sys.time()
results_alt <- foreach(
sim = 1:sim_config_alt$n_sims,
.combine = rbind,
.errorhandling = "remove",
.options.future = list(
packages = c("forestsearch", "survival", "data.table"),
seed = TRUE
)
) %dofuture% {
run_simulation_analysis(
sim_id = sim,
dgm = dgm_calibrated,
n_sample = sim_config_alt$n_sample,
max_follow = sim_config_alt$max_follow,
muC_adj = sim_config_alt$muC_adj,
confounders_base = confounders_base,
cox_formula_adj = survival::Surv(y.sim, event.sim) ~ treat + z1 + z2 + z3,
n_add_noise = 0L,
run_fs = TRUE,
run_fs_grf = FALSE,
run_grf = TRUE,
fs_params = fs_params,
verbose = TRUE,
debug = FALSE,
verbose_n = 3 # Only print first 3 simulations
)
}
=== Two-Stage Consistency Evaluation Enabled ===
Stage 1 screening splits: 30
Maximum total splits: 400
Batch size: 20
================================================
GRF stage for cut selection with dmin, tau = 12 0.6
tau, maxdepth = 48.53742 2
leaf.node control.mean control.size control.se depth
1 2 -5.33 520.00 1.11 1
2 3 1.02 180.00 2.32 1
11 4 -6.71 426.00 1.21 2
21 5 3.44 142.00 2.62 2
4 7 -7.11 78.00 3.18 2
GRF subgroup NOT found
GRF cuts identified: 0
# of continuous/categorical characteristics 4 3
Continuous characteristics: z2 z4 z5 size
Categorical characteristics: z1 z3 grade3
## Prior to lasso: z2 z4 z5 size
#### Lasso selection results
7 x 1 sparse Matrix of class "dgCMatrix"
s0
z1 0.28638272
z2 .
z3 0.04573174
z4 -0.34834939
z5 0.45235142
size .
grade3 0.01493646
Cox-LASSO selected: z1 z3 z4 z5 grade3
Cox-LASSO not selected: z2 size
### End Lasso selection
## After lasso: z4 z5
Default cuts included from Lasso: z4 <= mean(z4) z4 <= median(z4) z4 <= qlow(z4) z4 <= qhigh(z4) z5 <= mean(z5) z5 <= median(z5) z5 <= qlow(z5) z5 <= qhigh(z5)
Categorical after Lasso: z1 z3 grade3
Factors per GRF:
Initial GRF cuts included
Factors included per GRF (not in lasso)
===== CONSOLIDATED CUT EVALUATION (IMPROVED) =====
Evaluating 10 cut expressions once and caching...
Cut evaluation summary:
Total cuts: 10
Valid cuts: 10
Errors: 0
✓ All 10 factors validated as 0/1
===== END CONSOLIDATED CUT EVALUATION =====
# of candidate subgroup factors= 10
[1] "z4 <= 2.5" "z4 <= 3" "z4 <= 2" "z5 <= 2.4" "z5 <= 2" "z5 <= 1"
[7] "z5 <= 3" "z1" "z3" "grade3"
Number of possible configurations (<= maxk): maxk = 2 , # combinations = 210
Events criteria: control >= 12 , treatment >= 12
Sample size criteria: n >= 60
Subgroup search completed in 0.02 minutes
--- Filtering Summary ---
Combinations evaluated: 210
Passed variance check: 189
Passed prevalence (>= 0.025 ): 189
Passed redundancy check: 178
Passed event counts (d0>= 12 , d1>= 12 ): 160
Passed sample size (n>= 60 ): 158
Cox model fit successfully: 158
Passed HR threshold (>= 1.25 ): 3
-------------------------
Found 3 subgroup candidate(s)
# of candidate subgroups (meeting all criteria) = 3
Random seed set to: 8316951
Two-stage parameters:
n.splits.screen: 30
screen.threshold: 0.763
batch.size: 20
conf.level: 0.95
Removed 1 near-duplicate subgroups
# of unique initial candidates: 2
# Restricting to top stop_Kgroups = 10
# of candidates to evaluate: 2
# Early stop threshold: 0.95
Parallel config: workers = 6 , batch_size = 1
Batch 1 / 2 : candidates 1 - 1
Batch 2 / 2 : candidates 2 - 2
Evaluated 2 of 2 candidates (complete)
1 subgroups passed consistency threshold
SG focus = hr
Seconds and minutes forestsearch overall = 9.752 0.1625
Consistency algorithm used: twostage
tau, maxdepth = 48.53742 2
leaf.node control.mean control.size control.se depth
1 2 -5.33 520.00 1.11 1
2 3 1.02 180.00 2.32 1
11 4 -6.71 426.00 1.21 2
21 5 3.44 142.00 2.62 2
4 7 -7.11 78.00 3.18 2
GRF subgroup NOT found
=== Two-Stage Consistency Evaluation Enabled ===
Stage 1 screening splits: 30
Maximum total splits: 400
Batch size: 20
================================================
GRF stage for cut selection with dmin, tau = 12 0.6
tau, maxdepth = 49.03326 2
leaf.node control.mean control.size control.se depth
1 2 -5.56 534.00 1.18 1
2 3 2.39 166.00 2.32 1
3 4 -5.10 262.00 1.63 2
4 5 10.27 96.00 2.89 2
5 6 -7.05 336.00 1.52 2
Selected subgroup:
leaf.node control.mean control.size control.se depth
4 5 10.27 96.00 2.89 2
GRF subgroup found
Terminating node at max.diff (sg.harm.id):
[1] "z1 <= 0"
All splits (from all trees):
[1] "z1 <= 0" "z2 <= 2" "size <= 58"
GRF cuts identified: 3
Cuts: z1 <= 0, z2 <= 2, size <= 58
Selected tree depth: 2
# of continuous/categorical characteristics 4 3
Continuous characteristics: z2 z4 z5 size
Categorical characteristics: z1 z3 grade3
## Prior to lasso: z2 z4 z5 size
#### Lasso selection results
7 x 1 sparse Matrix of class "dgCMatrix"
s0
z1 .
z2 .
z3 .
z4 -0.2924157
z5 0.4106361
size .
grade3 .
Cox-LASSO selected: z4 z5
Cox-LASSO not selected: z1 z2 z3 size grade3
### End Lasso selection
## After lasso: z4 z5
Default cuts included from Lasso: z4 <= mean(z4) z4 <= median(z4) z4 <= qlow(z4) z4 <= qhigh(z4) z5 <= mean(z5) z5 <= median(z5) z5 <= qlow(z5) z5 <= qhigh(z5)
Categorical after Lasso:
Factors per GRF: z1 <= 0 z2 <= 2 size <= 58
Initial GRF cuts included z1 <= 0 z2 <= 2 size <= 58
Factors included per GRF (not in lasso) z1 <= 0 z2 <= 2 size <= 58
===== CONSOLIDATED CUT EVALUATION (IMPROVED) =====
Evaluating 10 cut expressions once and caching...
Cut evaluation summary:
Total cuts: 10
Valid cuts: 10
Errors: 0
✓ All 10 factors validated as 0/1
===== END CONSOLIDATED CUT EVALUATION =====
# of candidate subgroup factors= 10
[1] "z2 <= 2" "size <= 58" "z4 <= 2.5" "z4 <= 3" "z4 <= 2"
[6] "z5 <= 2.5" "z5 <= 2" "z5 <= 1.8" "z5 <= 3" "z1 <= 0"
Number of possible configurations (<= maxk): maxk = 2 , # combinations = 210
Events criteria: control >= 12 , treatment >= 12
Sample size criteria: n >= 60
Subgroup search completed in 0.01 minutes
--- Filtering Summary ---
Combinations evaluated: 210
Passed variance check: 189
Passed prevalence (>= 0.025 ): 189
Passed redundancy check: 174
Passed event counts (d0>= 12 , d1>= 12 ): 143
Passed sample size (n>= 60 ): 142
Cox model fit successfully: 142
Passed HR threshold (>= 1.25 ): 9
-------------------------
Found 9 subgroup candidate(s)
# of candidate subgroups (meeting all criteria) = 9
Random seed set to: 8316951
Two-stage parameters:
n.splits.screen: 30
screen.threshold: 0.763
batch.size: 20
conf.level: 0.95
Removed 3 near-duplicate subgroups
# of unique initial candidates: 6
# Restricting to top stop_Kgroups = 10
# of candidates to evaluate: 6
# Early stop threshold: 0.95
Parallel config: workers = 6 , batch_size = 1
Batch 1 / 6 : candidates 1 - 1
==================================================
EARLY STOP TRIGGERED (batch 1 )
Candidate: 1 of 6
Pcons: 1 >= 0.95
==================================================
Evaluated 1 of 6 candidates (early stop)
1 subgroups passed consistency threshold
SG focus = hr
Seconds and minutes forestsearch overall = 3.815 0.0636
Consistency algorithm used: twostage
tau, maxdepth = 49.03326 2
leaf.node control.mean control.size control.se depth
1 2 -5.56 534.00 1.18 1
2 3 2.39 166.00 2.32 1
3 4 -5.10 262.00 1.63 2
4 5 10.27 96.00 2.89 2
5 6 -7.05 336.00 1.52 2
Selected subgroup:
leaf.node control.mean control.size control.se depth
4 5 10.27 96.00 2.89 2
GRF subgroup found
Terminating node at max.diff (sg.harm.id):
[1] "z1 <= 0"
All splits (from all trees):
[1] "z1 <= 0" "z2 <= 2" "size <= 58"
=== Two-Stage Consistency Evaluation Enabled ===
Stage 1 screening splits: 30
Maximum total splits: 400
Batch size: 20
================================================
GRF stage for cut selection with dmin, tau = 12 0.6
tau, maxdepth = 49.57713 2
leaf.node control.mean control.size control.se depth
1 2 -6.22 494.00 1.21 1
2 3 5.44 206.00 2.11 1
21 5 -6.09 123.00 2.07 2
3 6 -6.92 385.00 1.46 2
4 7 7.19 143.00 2.55 2
Selected subgroup:
leaf.node control.mean control.size control.se depth
4 7 7.19 143.00 2.55 2
GRF subgroup found
Terminating node at max.diff (sg.harm.id):
[1] "z1 <= 0"
All splits (from all trees):
[1] "z1 <= 0" "z5 <= 1" "size <= 18"
GRF cuts identified: 3
Cuts: z1 <= 0, z5 <= 1, size <= 18
Selected tree depth: 2
# of continuous/categorical characteristics 4 3
Continuous characteristics: z2 z4 z5 size
Categorical characteristics: z1 z3 grade3
## Prior to lasso: z2 z4 z5 size
#### Lasso selection results
7 x 1 sparse Matrix of class "dgCMatrix"
s0
z1 .
z2 .
z3 0.07273592
z4 -0.40237672
z5 0.35008925
size .
grade3 .
Cox-LASSO selected: z3 z4 z5
Cox-LASSO not selected: z1 z2 size grade3
### End Lasso selection
## After lasso: z4 z5
Default cuts included from Lasso: z4 <= mean(z4) z4 <= median(z4) z4 <= qlow(z4) z4 <= qhigh(z4) z5 <= mean(z5) z5 <= median(z5) z5 <= qlow(z5) z5 <= qhigh(z5)
Categorical after Lasso: z3
Factors per GRF: z1 <= 0 z5 <= 1 size <= 18
Initial GRF cuts included z1 <= 0 z5 <= 1 size <= 18
Factors included per GRF (not in lasso) z1 <= 0 z5 <= 1 size <= 18
===== CONSOLIDATED CUT EVALUATION (IMPROVED) =====
Evaluating 11 cut expressions once and caching...
Cut evaluation summary:
Total cuts: 11
Valid cuts: 10
Errors: 0
Dropping variables (cut only has 1 level): z4 <= 4
Total cuts after dropping invalid: 10
✓ All 10 factors validated as 0/1
===== END CONSOLIDATED CUT EVALUATION =====
# of candidate subgroup factors= 10
[1] "z5 <= 1" "size <= 18" "z4 <= 2.5" "z4 <= 2" "z4 <= 1"
[6] "z5 <= 2.5" "z5 <= 2" "z5 <= 3" "z3" "z1 <= 0"
Number of possible configurations (<= maxk): maxk = 2 , # combinations = 210
Events criteria: control >= 12 , treatment >= 12
Sample size criteria: n >= 60
Subgroup search completed in 0.02 minutes
--- Filtering Summary ---
Combinations evaluated: 210
Passed variance check: 189
Passed prevalence (>= 0.025 ): 189
Passed redundancy check: 178
Passed event counts (d0>= 12 , d1>= 12 ): 162
Passed sample size (n>= 60 ): 154
Cox model fit successfully: 154
Passed HR threshold (>= 1.25 ): 14
-------------------------
Found 14 subgroup candidate(s)
# of candidate subgroups (meeting all criteria) = 14
Random seed set to: 8316951
Two-stage parameters:
n.splits.screen: 30
screen.threshold: 0.763
batch.size: 20
conf.level: 0.95
Removed 4 near-duplicate subgroups
# of unique initial candidates: 10
# Restricting to top stop_Kgroups = 10
# of candidates to evaluate: 10
# Early stop threshold: 0.95
Parallel config: workers = 6 , batch_size = 1
Batch 1 / 10 : candidates 1 - 1
==================================================
EARLY STOP TRIGGERED (batch 1 )
Candidate: 1 of 10
Pcons: 1 >= 0.95
==================================================
Evaluated 1 of 10 candidates (early stop)
1 subgroups passed consistency threshold
SG focus = hr
Seconds and minutes forestsearch overall = 4.801 0.08
Consistency algorithm used: twostage
tau, maxdepth = 49.57713 2
leaf.node control.mean control.size control.se depth
1 2 -6.22 494.00 1.21 1
2 3 5.44 206.00 2.11 1
21 5 -6.09 123.00 2.07 2
3 6 -6.92 385.00 1.46 2
4 7 7.19 143.00 2.55 2
Selected subgroup:
leaf.node control.mean control.size control.se depth
4 7 7.19 143.00 2.55 2
GRF subgroup found
Terminating node at max.diff (sg.harm.id):
[1] "z1 <= 0"
All splits (from all trees):
[1] "z1 <= 0" "z5 <= 1" "size <= 18"
runtime_alt <- difftime(Sys.time(), start_time, units = "mins")
cat("Completed in", round(runtime_alt, 1), "minutes\n")Completed in 10 minutes
cat("Results:", nrow(results_alt), "rows\n")Results: 2000 rows
5.4 Running Null Hypothesis Simulations
cat("Running", sim_config_null$n_sims, "simulations under H0...\n")Running 5000 simulations under H0...
start_time <- Sys.time()
results_null <- foreach(
sim = 1:sim_config_null$n_sims,
.combine = rbind,
.errorhandling = "remove",
.options.future = list(
packages = c("forestsearch", "survival", "data.table"),
seed = TRUE
)
) %dofuture% {
run_simulation_analysis(
sim_id = sim,
dgm = dgm_null,
n_sample = sim_config_null$n_sample,
max_follow = sim_config_null$max_follow,
muC_adj = sim_config_null$muC_adj,
confounders_base = confounders_base,
cox_formula_adj = survival::Surv(y.sim, event.sim) ~ treat + z1 + z2 + z3,
n_add_noise = 0L,
run_fs = TRUE,
run_fs_grf = FALSE,
run_grf = TRUE,
fs_params = fs_params,
verbose = TRUE,
verbose_n = 3 # Only print first 3 simulations
)
}
=== Two-Stage Consistency Evaluation Enabled ===
Stage 1 screening splits: 30
Maximum total splits: 400
Batch size: 20
================================================
GRF stage for cut selection with dmin, tau = 12 0.6
tau, maxdepth = 47.91247 2
leaf.node control.mean control.size control.se depth
2 3 -4.78 695.00 1.00 1
1 4 -5.09 568.00 1.10 2
GRF subgroup NOT found
GRF cuts identified: 0
# of continuous/categorical characteristics 4 3
Continuous characteristics: z2 z4 z5 size
Categorical characteristics: z1 z3 grade3
## Prior to lasso: z2 z4 z5 size
#### Lasso selection results
7 x 1 sparse Matrix of class "dgCMatrix"
s0
z1 0.0078483334
z2 0.0324623135
z3 .
z4 -0.3338944143
z5 0.4663620842
size 0.0002302206
grade3 .
Cox-LASSO selected: z1 z2 z4 z5 size
Cox-LASSO not selected: z3 grade3
### End Lasso selection
## After lasso: z2 z4 z5 size
Default cuts included from Lasso: z2 <= mean(z2) z2 <= median(z2) z2 <= qlow(z2) z2 <= qhigh(z2) z4 <= mean(z4) z4 <= median(z4) z4 <= qlow(z4) z4 <= qhigh(z4) z5 <= mean(z5) z5 <= median(z5) z5 <= qlow(z5) z5 <= qhigh(z5) size <= mean(size) size <= median(size) size <= qlow(size) size <= qhigh(size)
Categorical after Lasso: z1
Factors per GRF:
Initial GRF cuts included
Factors included per GRF (not in lasso)
===== CONSOLIDATED CUT EVALUATION (IMPROVED) =====
Evaluating 15 cut expressions once and caching...
Cut evaluation summary:
Total cuts: 15
Valid cuts: 14
Errors: 0
Dropping variables (cut only has 1 level): z2 <= 4
Total cuts after dropping invalid: 14
✓ All 14 factors validated as 0/1
===== END CONSOLIDATED CUT EVALUATION =====
# of candidate subgroup factors= 14
[1] "z2 <= 2.5" "z2 <= 2" "z4 <= 2.5" "z4 <= 3" "z4 <= 2"
[6] "z5 <= 2.4" "z5 <= 2" "z5 <= 1" "z5 <= 3" "size <= 29.1"
[11] "size <= 25" "size <= 20" "size <= 35" "z1"
Number of possible configurations (<= maxk): maxk = 2 , # combinations = 406
Events criteria: control >= 12 , treatment >= 12
Sample size criteria: n >= 60
Subgroup search completed in 0.05 minutes
--- Filtering Summary ---
Combinations evaluated: 406
Passed variance check: 373
Passed prevalence (>= 0.025 ): 373
Passed redundancy check: 354
Passed event counts (d0>= 12 , d1>= 12 ): 327
Passed sample size (n>= 60 ): 316
Cox model fit successfully: 316
Passed HR threshold (>= 1.25 ): 4
-------------------------
Found 4 subgroup candidate(s)
# of candidate subgroups (meeting all criteria) = 4
Random seed set to: 8316951
Two-stage parameters:
n.splits.screen: 30
screen.threshold: 0.763
batch.size: 20
conf.level: 0.95
Removed 2 near-duplicate subgroups
# of unique initial candidates: 2
# Restricting to top stop_Kgroups = 10
# of candidates to evaluate: 2
# Early stop threshold: 0.95
Parallel config: workers = 6 , batch_size = 1
Batch 1 / 2 : candidates 1 - 1
Batch 2 / 2 : candidates 2 - 2
Evaluated 2 of 2 candidates (complete)
No subgroups found meeting consistency threshold
Seconds and minutes forestsearch overall = 12.158 0.2026
Consistency algorithm used: twostage
tau, maxdepth = 47.91247 2
leaf.node control.mean control.size control.se depth
2 3 -4.78 695.00 1.00 1
1 4 -5.09 568.00 1.10 2
GRF subgroup NOT found
=== Two-Stage Consistency Evaluation Enabled ===
Stage 1 screening splits: 30
Maximum total splits: 400
Batch size: 20
================================================
GRF stage for cut selection with dmin, tau = 12 0.6
tau, maxdepth = 49.50766 2
leaf.node control.mean control.size control.se depth
1 2 -4.62 689.00 1.07 1
2 5 -6.72 124.00 2.32 2
3 6 -5.20 513.00 1.27 2
GRF subgroup NOT found
GRF cuts identified: 0
# of continuous/categorical characteristics 4 3
Continuous characteristics: z2 z4 z5 size
Categorical characteristics: z1 z3 grade3
## Prior to lasso: z2 z4 z5 size
#### Lasso selection results
7 x 1 sparse Matrix of class "dgCMatrix"
s0
z1 .
z2 .
z3 -0.1081342
z4 -0.2654241
z5 0.4210064
size .
grade3 .
Cox-LASSO selected: z3 z4 z5
Cox-LASSO not selected: z1 z2 size grade3
### End Lasso selection
## After lasso: z4 z5
Default cuts included from Lasso: z4 <= mean(z4) z4 <= median(z4) z4 <= qlow(z4) z4 <= qhigh(z4) z5 <= mean(z5) z5 <= median(z5) z5 <= qlow(z5) z5 <= qhigh(z5)
Categorical after Lasso: z3
Factors per GRF:
Initial GRF cuts included
Factors included per GRF (not in lasso)
===== CONSOLIDATED CUT EVALUATION (IMPROVED) =====
Evaluating 8 cut expressions once and caching...
Cut evaluation summary:
Total cuts: 8
Valid cuts: 8
Errors: 0
✓ All 8 factors validated as 0/1
===== END CONSOLIDATED CUT EVALUATION =====
# of candidate subgroup factors= 8
[1] "z4 <= 2.5" "z4 <= 3" "z4 <= 2" "z5 <= 2.5" "z5 <= 2" "z5 <= 1.8"
[7] "z5 <= 3" "z3"
Number of possible configurations (<= maxk): maxk = 2 , # combinations = 136
Events criteria: control >= 12 , treatment >= 12
Sample size criteria: n >= 60
Subgroup search completed in 0.03 minutes
--- Filtering Summary ---
Combinations evaluated: 136
Passed variance check: 117
Passed prevalence (>= 0.025 ): 117
Passed redundancy check: 106
Passed event counts (d0>= 12 , d1>= 12 ): 98
Passed sample size (n>= 60 ): 98
Cox model fit successfully: 98
Passed HR threshold (>= 1.25 ): 0
-------------------------
NO subgroup candidates found
tau, maxdepth = 49.50766 2
leaf.node control.mean control.size control.se depth
1 2 -4.62 689.00 1.07 1
2 5 -6.72 124.00 2.32 2
3 6 -5.20 513.00 1.27 2
GRF subgroup NOT found
=== Two-Stage Consistency Evaluation Enabled ===
Stage 1 screening splits: 30
Maximum total splits: 400
Batch size: 20
================================================
GRF stage for cut selection with dmin, tau = 12 0.6
tau, maxdepth = 46.7094 2
leaf.node control.mean control.size control.se depth
1 2 2.55 101.00 2.51 1
2 3 -5.13 599.00 1.05 1
3 4 3.54 91.00 2.44 2
4 5 -6.03 449.00 1.13 2
5 6 -7.26 105.00 2.90 2
GRF subgroup NOT found
GRF cuts identified: 0
# of continuous/categorical characteristics 4 3
Continuous characteristics: z2 z4 z5 size
Categorical characteristics: z1 z3 grade3
## Prior to lasso: z2 z4 z5 size
#### Lasso selection results
7 x 1 sparse Matrix of class "dgCMatrix"
s0
z1 -0.2046970204
z2 .
z3 -0.0162003130
z4 -0.3904744742
z5 0.3456587351
size .
grade3 0.0002404273
Cox-LASSO selected: z1 z3 z4 z5 grade3
Cox-LASSO not selected: z2 size
### End Lasso selection
## After lasso: z4 z5
Default cuts included from Lasso: z4 <= mean(z4) z4 <= median(z4) z4 <= qlow(z4) z4 <= qhigh(z4) z5 <= mean(z5) z5 <= median(z5) z5 <= qlow(z5) z5 <= qhigh(z5)
Categorical after Lasso: z1 z3 grade3
Factors per GRF:
Initial GRF cuts included
Factors included per GRF (not in lasso)
===== CONSOLIDATED CUT EVALUATION (IMPROVED) =====
Evaluating 10 cut expressions once and caching...
Cut evaluation summary:
Total cuts: 10
Valid cuts: 9
Errors: 0
Dropping variables (cut only has 1 level): z4 <= 4
Total cuts after dropping invalid: 9
✓ All 9 factors validated as 0/1
===== END CONSOLIDATED CUT EVALUATION =====
# of candidate subgroup factors= 9
[1] "z4 <= 2.5" "z4 <= 2" "z4 <= 1" "z5 <= 2.5" "z5 <= 2" "z5 <= 3"
[7] "z1" "z3" "grade3"
Number of possible configurations (<= maxk): maxk = 2 , # combinations = 171
Events criteria: control >= 12 , treatment >= 12
Sample size criteria: n >= 60
Subgroup search completed in 0.01 minutes
--- Filtering Summary ---
Combinations evaluated: 171
Passed variance check: 154
Passed prevalence (>= 0.025 ): 154
Passed redundancy check: 146
Passed event counts (d0>= 12 , d1>= 12 ): 141
Passed sample size (n>= 60 ): 135
Cox model fit successfully: 135
Passed HR threshold (>= 1.25 ): 0
-------------------------
NO subgroup candidates found
tau, maxdepth = 46.7094 2
leaf.node control.mean control.size control.se depth
1 2 2.55 101.00 2.51 1
2 3 -5.13 599.00 1.05 1
3 4 3.54 91.00 2.44 2
4 5 -6.03 449.00 1.13 2
5 6 -7.26 105.00 2.90 2
GRF subgroup NOT found
runtime_null <- difftime(Sys.time(), start_time, units = "mins")
cat("Completed in", round(runtime_null, 1), "minutes\n")Completed in 41.2 minutes
6 Summarizing Results
6.1 Operating Characteristics Summary
# Summarize alternative hypothesis results
summary_alt <- summarize_simulation_results(results_alt)
print(summary_alt) FS GRF
any.H 0.880 0.750
sens 0.850 0.890
spec 0.980 0.970
ppv 0.890 0.840
npv 0.980 0.980
Avg(#H) 85.000 98.000
minH 61.000 60.000
maxH 226.000 224.000
Avg(#Hc) 625.000 626.000
minHc 474.000 476.000
maxHc 700.000 700.000
hat(H*) 2.275 NaN
hat(hat[H]) 2.316 2.118
hat(Hc*) 0.636 NaN
hat(hat[Hc]) 0.642 0.632
hat(H*)all 2.275 NaN
hat(Hc*)all 0.636 NaN
hat(ITT) 0.744 0.744
hat(ITTadj) 0.761 0.761
# Summarize null hypothesis results
summary_null <- summarize_simulation_results(results_null)
print(summary_null) FS GRF
any.H 0.060 0.060
sens NaN NaN
spec 0.860 0.890
ppv 0.000 0.000
npv 1.000 1.000
Avg(#H) 98.000 78.000
minH 61.000 60.000
maxH 386.000 177.000
Avg(#Hc) 694.000 695.000
minHc 314.000 523.000
maxHc 700.000 700.000
hat(H*) NaN NaN
hat(hat[H]) 1.804 1.524
hat(Hc*) 0.766 NaN
hat(hat[Hc]) 0.675 0.666
hat(H*)all NaN NaN
hat(Hc*)all 0.766 NaN
hat(ITT) 0.698 0.698
hat(ITTadj) 0.691 0.691
7 Theoretical Subgroup Detection Rate Approximation
The function compute_detection_probability() provides an analytical approximation based on asymptotic normal theory:
#| label: theoretical-power
#| fig-width: 8
#| fig-height: 6
# =============================================================================
# Theoretical Detection Probability Analysis
# =============================================================================
# Calculate expected subgroup characteristics
n_sg_expected <- sim_config_alt$n_sample * mean(dgm_calibrated$df_super_rand$flag.harm)
prop_cens <- mean(results_alt$p.cens) # Censoring proportion
cat("=== Subgroup Characteristics ===\n")=== Subgroup Characteristics ===
cat("Expected subgroup size (n_sg):", round(n_sg_expected), "\n")Expected subgroup size (n_sg): 89
cat("Censoring proportion:", round(prop_cens, 3), "\n")Censoring proportion: 0.453
cat("True HR in H:", round(dgm_calibrated$hr_H_true, 3), "\n")True HR in H: 2
cat("HR threshold:", fs_params$hr.threshold, "\n")HR threshold: 1.25
# -----------------------------------------------------------------------------
# Single-Point Detection Probability
# -----------------------------------------------------------------------------
# True H is dgm_calibrated$hr_H_true
# However we want at plim of observed estimate
#plim_hr_hatH <- c(summary_alt[c("hat(hat[H])"),1])
dgm_calibrated$hr_H_true treat
1.999999
# Compute detection probability at the true HR
prob_detect <- compute_detection_probability(
theta = dgm_calibrated$hr_H_true,
n_sg = round(n_sg_expected),
prop_cens = prop_cens,
hr_threshold = fs_params$hr.threshold,
hr_consistency = fs_params$hr.consistency,
method = "cubature"
)
# Compare theoretical to empirical (alternative)
cat("\n=== Detection Probability Comparison ===\n")
=== Detection Probability Comparison ===
cat("Theoretical FS (asymptotic):", round(prob_detect, 3), "\n")Theoretical FS (asymptotic): 0.899
cat("Empirical FS:", round(mean(results_alt[analysis == "FS"]$any.H), 3), "\n")Empirical FS: 0.88
cat("Empirical FSlg:", round(mean(results_alt[analysis == "FSlg"]$any.H), 3), "\n")Empirical FSlg: NaN
if ("GRF" %in% results_alt$analysis) {
cat("Empirical GRF:", round(mean(results_alt[analysis == "GRF"]$any.H), 3), "\n")
}Empirical GRF: 0.754
# Null
#plim_hr_itt <- c(summary_alt[c("hat(ITT)all"),1])
# Calculate at min SG size
# Compute detection probability at the true HR
prob_detect_null <- compute_detection_probability(
theta = dgm_null$hr_causal,
n_sg = fs_params$n.min,
prop_cens = prop_cens,
hr_threshold = fs_params$hr.threshold,
hr_consistency = fs_params$hr.consistency,
method = "cubature"
)
# Compare theoretical to empirical (alternative)
cat("\n=== Detection Probability Comparison ===\n")
=== Detection Probability Comparison ===
cat("Under the null calculate at min SG size:", fs_params$n.min,"\n")Under the null calculate at min SG size: 60
cat("Theoretical FS at min(SG) (asymptotic):", round(prob_detect_null, 6), "\n")Theoretical FS at min(SG) (asymptotic): 0.039618
cat("Empirical FS:", round(mean(results_null[analysis == "FS"]$any.H), 6), "\n")Empirical FS: 0.0596
cat("Empirical FSlg:", round(mean(results_null[analysis == "FSlg"]$any.H), 6), "\n")Empirical FSlg: NaN
if ("GRF" %in% results_null$analysis) {
cat("Empirical GRF:", round(mean(results_null[analysis == "GRF"]$any.H), 6), "\n")
}Empirical GRF: 0.0648
prop_cens <- mean(results_null$p.cens) # Censoring proportion
cat("Censoring proportion:", round(prop_cens, 3), "\n")Censoring proportion: 0.463
# -----------------------------------------------------------------------------
# Generate Full Detection Curve
# -----------------------------------------------------------------------------
# Generate detection probability curve across HR values
detection_curve <- generate_detection_curve(
theta_range = c(0.5, 3.0),
n_points = 50,
n_sg = round(n_sg_expected),
prop_cens = prop_cens,
hr_threshold = fs_params$hr.threshold,
hr_consistency = fs_params$hr.consistency,
include_reference = TRUE,
verbose = FALSE
)
# -----------------------------------------------------------------------------
# Visualization
# -----------------------------------------------------------------------------
# Plot detection curve with empirical overlay
plot_detection_curve(
detection_curve,
add_reference_lines = TRUE,
add_threshold_line = TRUE,
title = sprintf(
"Detection Probability Curve (n=%d, cens=%.0f%%, threshold=%.2f)",
round(n_sg_expected), 100 * prop_cens, fs_params$hr.threshold
)
)
# Add empirical results as points
empirical_rates <- c(
FS = mean(results_alt[analysis == "FS"]$any.H),
FSlg = mean(results_alt[analysis == "FSlg"]$any.H)
)
if ("GRF" %in% results_alt$analysis) {
empirical_rates["GRF"] <- mean(results_alt[analysis == "GRF"]$any.H)
}
# Mark the true HR and empirical detection rates
points(
x = rep(dgm_calibrated$hr_H_true, length(empirical_rates)),
y = empirical_rates,
pch = c(16, 17, 18)[1:length(empirical_rates)],
col = c("blue", "darkgreen", "purple")[1:length(empirical_rates)],
cex = 1.5
)
# Add vertical line at true HR
abline(v = dgm_calibrated$hr_H_true, lty = 2, col = "blue", lwd = 1)
# Legend for empirical points
legend(
"topleft",
legend = c(
sprintf("H true = %.2f", dgm_calibrated$hr_H_true),
paste(names(empirical_rates), "=", round(empirical_rates, 3))
),
pch = c(NA, 16, 17, 18)[1:(length(empirical_rates) + 1)],
lty = c(2, rep(NA, length(empirical_rates))),
col = c("blue", "blue", "darkgreen", "purple")[1:(length(empirical_rates) + 1)],
cex = 0.8,
bty = "n"
)7.1 AHR Metrics in Results (New)
The aligned analysis functions now compute AHR estimates in addition to Cox-based HRs:
# Check for AHR columns in results
ahr_cols <- grep("ahr", names(results_alt), value = TRUE)
cat("AHR columns in results:", paste(ahr_cols, collapse = ", "), "\n\n")AHR columns in results:
if (length(ahr_cols) > 0) {
# Summarize AHR estimates
results_found <- results_alt[results_alt$any.H == 1, ]
if (nrow(results_found) > 0 && "ahr.H.hat" %in% names(results_found)) {
cat("AHR estimates (when subgroup found):\n")
cat(" Mean AHR(H) estimated:", round(mean(results_found$ahr.H.hat, na.rm = TRUE), 3), "\n")
cat(" Mean AHR(Hc) estimated:", round(mean(results_found$ahr.Hc.hat, na.rm = TRUE), 3), "\n")
cat(" True AHR(H):", round(dgm_calibrated$AHR_H_true, 3), "\n")
cat(" True AHR(Hc):", round(dgm_calibrated$AHR_Hc_true, 3), "\n")
}
}7.2 Formatted Tables
# Format operating characteristics for H1
format_oc_results(
results = results_alt,
title = "Operating Characteristics (Alternative Hypothesis)",
subtitle = sprintf("n = %d, %d simulations, HR(H) = %.2f",
sim_config_alt$n_sample,
sim_config_alt$n_sims,
dgm_calibrated$hr_H_true),
use_gt = TRUE
)| Operating Characteristics (Alternative Hypothesis) | ||||||||||||||
| n = 700, 1000 simulations, HR(H) = 2.00 | ||||||||||||||
| Method | N Sims | Detection |
Classification
|
Hazard Ratios
|
Size_H_mean | Size_H_min | Size_H_max | |||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Sen | Spec | PPV | NPV | HR_H_hat | HR_Hc_hat | HR_H_true | HR_Hc_true | HR_ITT | ||||||
| FS | 1000 | 0.880 | 0.853 | 0.984 | 0.893 | 0.979 | 2.316 | 0.642 | 2.275 | 0.636 | 0.744 | 85 | 61 | 226 |
| GRF | 1000 | 0.754 | 0.892 | 0.970 | 0.840 | 0.984 | 2.118 | 0.632 | - | - | 0.744 | 98 | 60 | 224 |
# Format operating characteristics for H0
format_oc_results(
results = results_null,
title = "Type I Error (Null Hypothesis)",
subtitle = sprintf("n = %d, %d simulations, HR(overall) = %.2f",
sim_config_null$n_sample,
sim_config_null$n_sims,
dgm_null$hr_causal),
use_gt = TRUE
)| Type I Error (Null Hypothesis) | ||||||||||||||
| n = 700, 5000 simulations, HR(overall) = 0.72 | ||||||||||||||
| Method | N Sims | Detection |
Classification
|
Hazard Ratios
|
Size_H_mean | Size_H_min | Size_H_max | |||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Sen | Spec | PPV | NPV | HR_H_hat | HR_Hc_hat | HR_H_true | HR_Hc_true | HR_ITT | ||||||
| FS | 5000 | 0.060 | - | 0.860 | 0.000 | 1.000 | 1.804 | 0.675 | - | 0.766 | 0.698 | 98 | 61 | 386 |
| GRF | 5000 | 0.065 | - | 0.888 | 0.000 | 1.000 | 1.524 | 0.666 | - | - | 0.698 | 78 | 60 | 177 |
7.3 Key Metrics
# Extract key metrics
cat("=== KEY OPERATING CHARACTERISTICS ===\n\n")=== KEY OPERATING CHARACTERISTICS ===
cat("Alternative Hypothesis (H1):\n")Alternative Hypothesis (H1):
for (analysis in unique(results_alt$analysis)) {
res <- results_alt[results_alt$analysis == analysis, ]
cat(sprintf(" %s: Power = %.3f, Sens = %.3f, Spec = %.3f, PPV = %.3f\n",
analysis,
mean(res$any.H),
mean(res$sens, na.rm = TRUE),
mean(res$spec, na.rm = TRUE),
mean(res$ppv, na.rm = TRUE)))
} FS: Power = 0.817, Sens = 0.871, Spec = 0.978, PPV = 0.869
GRF: Power = 0.817, Sens = 0.871, Spec = 0.978, PPV = 0.869
cat("\nNull Hypothesis (H0):\n")
Null Hypothesis (H0):
for (analysis in unique(results_null$analysis)) {
res <- results_null[results_null$analysis == analysis, ]
cat(sprintf(" %s: Type I Error = %.4f\n",
analysis,
mean(res$any.H)))
} FS: Type I Error = 0.0622
GRF: Type I Error = 0.0622
8 Advanced Topics
8.1 Comparing Standard vs Two-Stage Algorithm
# Run simulations with two-stage algorithm for comparison
results_twostage <- foreach(
sim = 1:100,
.combine = rbind,
.options.future = list(
packages = c("forestsearch", "survival", "data.table"),
seed = TRUE
)
) %dofuture% {
run_simulation_analysis(
sim_id = sim,
dgm = dgm_calibrated,
n_sample = sim_config_alt$n_sample,
confounders_base = confounders_base,
run_fs = TRUE,
run_fs_grf = FALSE,
run_grf = FALSE,
fs_params = fs_params_fast, # use_twostage = TRUE
verbose = FALSE
)
}
# Compare detection rates
cat("Standard algorithm power:", round(mean(results_alt$any.H[results_alt$analysis == "FS"]), 3), "\n")
cat("Two-stage algorithm power:", round(mean(results_twostage$any.H), 3), "\n")8.2 Adding Noise Variables
Test ForestSearch robustness by including irrelevant noise variables:
# Run simulations with noise variables
results_noise <- foreach(
sim = 1:sim_config_alt$n_sims,
.combine = rbind,
.errorhandling = "remove",
.options.future = list(
packages = c("forestsearch", "survival", "data.table"),
seed = TRUE
)
) %dofuture% {
run_simulation_analysis(
sim_id = sim,
dgm = dgm_calibrated,
n_sample = sim_config_alt$n_sample,
confounders_base = confounders_base,
n_add_noise = 10, # Add 10 noise variables
run_fs = TRUE,
fs_params = fs_params,
verbose = FALSE
)
}
# Compare detection rates
cat("Without noise:", round(mean(results_alt$any.H), 3), "\n")
cat("With 10 noise vars:", round(mean(results_noise$any.H), 3), "\n")8.3 Sensitivity Analysis: Varying Parameters
# Test different HR thresholds
thresholds <- c(1.10, 1.25, 1.50)
results_by_thresh <- list()
for (thresh in thresholds) {
results_by_thresh[[as.character(thresh)]] <- foreach(
sim = 1:100,
.combine = rbind,
.options.future = list(
packages = c("forestsearch", "survival", "data.table"),
seed = TRUE
)
) %dofuture% {
run_simulation_analysis(
sim_id = sim,
dgm = dgm_calibrated,
n_sample = sim_config_alt$n_sample,
confounders_base = confounders_base,
run_fs = TRUE,
fs_params = modifyList(fs_params, list(hr.threshold = thresh)),
verbose = FALSE
)
}
results_by_thresh[[as.character(thresh)]]$threshold <- thresh
}
# Combine and summarize
combined <- rbindlist(results_by_thresh)
combined[, .(power = mean(any.H), ppv = mean(ppv, na.rm = TRUE)),
by = .(threshold, analysis)]9 Saving Results
# Save simulation results for later use
save_simulation_results(
results = results_alt,
dgm = dgm_calibrated,
summary_table = summary_alt,
runtime_hours = as.numeric(runtime_alt) / 60,
output_file = "forestsearch_simulation_alt.Rdata",
# Include AHR metrics in saved output
ahr_metrics = list(
AHR_H_true = dgm_calibrated$AHR_H_true,
AHR_Hc_true = dgm_calibrated$AHR_Hc_true,
AHR = dgm_calibrated$AHR
)
)
save_simulation_results(
results = results_null,
dgm = dgm_null,
summary_table = summary_null,
runtime_hours = as.numeric(runtime_null) / 60,
output_file = "forestsearch_simulation_null.Rdata"
)10 Complete Example Script
Here’s a minimal self-contained script for running a simulation study:
# ===========================================================================
# Complete ForestSearch Simulation Study - Minimal Example (Aligned)
# ===========================================================================
library(forestsearch)
library(data.table)
library(survival)
library(foreach)
library(doFuture)
# --- Configuration ---
N_SIMS <- 500
N_SAMPLE <- 500
TARGET_HR_HARM <- 1.5
# --- Setup parallel processing ---
plan(multisession, workers = 4)
registerDoFuture()
# --- Create DGM ---
# Option 1: Calibrate to Cox-based HR
k_inter <- calibrate_k_inter(target_hr_harm = TARGET_HR_HARM,
use_ahr = FALSE, verbose = TRUE)
# Option 2: Calibrate to AHR instead
# k_inter <- calibrate_k_inter(target_hr_harm = TARGET_HR_HARM,
# use_ahr = TRUE, verbose = TRUE)
dgm <- create_gbsg_dgm(model = "alt", k_inter = k_inter, verbose = TRUE)
# Verify hazard ratios (new aligned output)
cat("\nDGM Summary:\n")
cat(" Cox HR(H):", round(dgm$hr_H_true, 3), "\n")
cat(" AHR(H):", round(dgm$AHR_H_true, 3), "\n")
cat(" Cox HR(Hc):", round(dgm$hr_Hc_true, 3), "\n")
cat(" AHR(Hc):", round(dgm$AHR_Hc_true, 3), "\n")
# --- Run simulations ---
confounders <- c("v1", "v2", "v3", "v4", "v5", "v6", "v7")
results <- foreach(
sim = 1:N_SIMS,
.combine = rbind,
.options.future = list(
packages = c("forestsearch", "survival", "data.table"),
seed = TRUE
)
) %dofuture% {
run_simulation_analysis(
sim_id = sim,
dgm = dgm,
n_sample = N_SAMPLE,
max_follow = 60,
confounders_base = confounders,
run_fs = TRUE,
run_fs_grf = TRUE,
run_grf = FALSE,
fs_params = list(
hr.threshold = 1.25,
fs.splits = 300,
maxk = 2,
use_twostage = FALSE # Set TRUE for faster analysis
)
)
}
# --- Summarize ---
summary_table <- summarize_simulation_results(results)
print(summary_table)
# --- Display formatted table ---
format_oc_results(results = results, title = sprintf("Operating Characteristics (n=%d)", N_SAMPLE))
# --- Report AHR metrics (new) ---
results_found <- results[results$any.H == 1, ]
if (nrow(results_found) > 0 && "ahr.H.hat" %in% names(results_found)) {
cat("\nAHR Estimates:\n")
cat(" True AHR(H):", round(dgm$AHR_H_true, 3), "\n")
cat(" Mean estimated AHR(H):", round(mean(results_found$ahr.H.hat, na.rm = TRUE), 3), "\n")
}11 Summary
This vignette demonstrated the complete workflow for evaluating ForestSearch performance through simulation:
| Step | Function | Purpose |
|---|---|---|
| 1. Create DGM | create_gbsg_dgm() |
Define data generating mechanism |
| 2. Calibrate | calibrate_k_inter() |
Achieve target subgroup HR (Cox or AHR) |
| 3. Validate | validate_k_inter_effect() |
Verify heterogeneity control |
| 4. Simulate | simulate_from_gbsg_dgm() |
Generate trial data |
| 5. Analyze | run_simulation_analysis() |
Run ForestSearch/GRF |
| 6. Summarize | summarize_simulation_results() |
Aggregate metrics |
| 7. Display | format_oc_results() |
Create gt tables |
Key metrics to report:
- Power (H1) / Type I Error (H0): Subgroup detection rate
- Sensitivity: P(identified | true harm subgroup)
- Specificity: P(not identified | true complement)
- PPV: P(true harm | identified)
- NPV: P(true complement | not identified)
New aligned features:
- AHR metrics: Alternative to Cox-based HR (from
loghr_po) use_ahrcalibration: Calibrate to AHR instead of Cox HRuse_twostage: Faster two-stage search algorithm option- Individual effects: Access
theta_0,theta_1,loghr_poper subject
12 Session Info
sessionInfo()R version 4.5.1 (2025-06-13)
Platform: aarch64-apple-darwin20
Running under: macOS Tahoe 26.2
Matrix products: default
BLAS: /Library/Frameworks/R.framework/Versions/4.5-arm64/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.5-arm64/Resources/lib/libRlapack.dylib; LAPACK version 3.12.1
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
time zone: America/Los_Angeles
tzcode source: internal
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] doFuture_1.2.0 future_1.69.0 foreach_1.5.2
[4] gt_1.3.0 ggplot2_4.0.1 survival_3.8-6
[7] data.table_1.18.0 weightedsurv_0.1.0 forestsearch_0.0.0.9000
loaded via a namespace (and not attached):
[1] sass_0.4.10 generics_0.1.4 xml2_1.4.0
[4] shape_1.4.6.1 stringi_1.8.7 lattice_0.22-7
[7] cubature_2.1.4-1 listenv_0.9.1 digest_0.6.37
[10] magrittr_2.0.4 grf_2.5.0 evaluate_1.0.5
[13] grid_4.5.1 RColorBrewer_1.1-3 iterators_1.0.14
[16] policytree_1.2.3 fastmap_1.2.0 jsonlite_2.0.0
[19] Matrix_1.7-3 glmnet_4.1-10 scales_1.4.0
[22] codetools_0.2-20 cli_3.6.5 rlang_1.1.6
[25] parallelly_1.46.1 future.apply_1.20.1 splines_4.5.1
[28] withr_3.0.2 yaml_2.3.10 tools_4.5.1
[31] parallel_4.5.1 dplyr_1.1.4 globals_0.18.0
[34] vctrs_0.6.5 R6_2.6.1 lifecycle_1.0.4
[37] stringr_1.6.0 randomForest_4.7-1.2 fs_1.6.6
[40] htmlwidgets_1.6.4 pkgconfig_2.0.3 progressr_0.18.0
[43] pillar_1.11.1 gtable_0.3.6 glue_1.8.0
[46] Rcpp_1.1.0 xfun_0.53 tibble_3.3.0
[49] tidyselect_1.2.1 rstudioapi_0.17.1 knitr_1.51
[52] farver_2.1.2 htmltools_0.5.8.1 rmarkdown_2.30
[55] compiler_4.5.1 S7_0.2.0
13 References
León LF, Marceau-West CT, He W, et al. (2024). “Identifying Patient Subgroups with Differential Treatment Effects: A Forest Search Approach.” Statistics in Medicine.
Athey S, Imbens GW. (2016). “Recursive partitioning for heterogeneous causal effects.” PNAS, 113(27):7353-7360.
Wager S, Athey S. (2018). “Estimation and inference of heterogeneous treatment effects using random forests.” JASA, 113(523):1228-1242.